Make read "connection reset" errors retryable. #2926

emil2k · 2019-11-05T12:34:48Z

Commit 2a80185 stopped making send
request errors automatically retryable and deferred to the default
retryer to determine whether a request should be retryable.

Then in commit c3d2710 a distinction
was made between read and write "connection reset" errors with the
former now not being retryable. There is no explanation in the commmit
or the PR why the distinction was made, but this change in SDK behavior
is causing sporadic fatal errors for us in production when connecting to
AWS services such as S3 and DynamoDB.

Related: #2908

cc @jasdel

Commit 2a80185 stopped making send request errors automatically retryable and deferred to the default retryer to determine whether a request should be retryable. Then in commit c3d2710 a distinction was made between read and write "connection reset" errors with the latter now not being retryable. There is no explanation in the commmit or the PR why the distinction was made, but this change in SDK behavior is causing sporadic fatal errors for us in production when connecting to AWS services such as S3 and DynamoDB.

skmcgrail · 2019-11-13T01:36:03Z

Thanks for reaching out to us @emil2k. You are correct that the behavior of the SDK was changed as part of commit c3d2710 to not retry read connection reset errors. The logic behind this change is that the SDK is not able to sufficiently determine the state of the API request after successfully failing to write a request to the service, but failed to read the corresponding response due to a connection reset. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

It is possible for you to provide a custom retryer that can handle this logic within your application, and there is an example provided at example/aws/request/customRetryer on how to do this.

It could be beneficial to surface this error in a better way then wrapping it in the default awserr.Error and providing it it's own concrete error that could be compared against. As right now detection of this requires substring matching of the original error message by users which is not ideal. We would welcome a PR request that would provide a helper utility that could be used by users looking to implement a custom retryer who wish to detect this error case and implement their own retry logic.

emil2k · 2019-11-13T11:17:55Z

The logic behind this change is that the SDK is not able to sufficiently determine the state of the API request after successfully failing to write a request to the service, but failed to read the corresponding response due to a connection reset. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

I'm sorry @skmcgrail, I don't quite understand these lines. If you are unable to determine the state of the API request after a failed write, why would you then assume that a failed write request is idempotent and retryable? And then why not make the same assumption for a read request?

skmcgrail · 2019-11-13T22:39:59Z

The logic behind this change is that the SDK is not able to sufficiently determine the state of the API request after successfully failing to write a request to the service, but failed to read the corresponding response due to a connection reset. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

I'm sorry @skmcgrail, I don't quite understand these lines. If you are unable to determine the state of the API request after a failed write, why would you then assume that a failed write request is idempotent and retryable? And then why not make the same assumption for a read request?

Apologies, it looks like missed a typo that led to some confusion. Here is a corrected statement around the issue:
The logic behind this change is that the SDK is not able to sufficiently determine the state of an API request after successfully writing the request to the transport layer, but then failing to read the corresponding response due to a connection reset occurring. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

emil2k · 2019-11-18T11:45:55Z

Thank you for the explanation @skmcgrail. I get your reasoning now, but in an ideal situation, it seems this should be handled somewhat how Stripe does by adding an idempotent layer to the API.

It could be beneficial to surface this error in a better way then wrapping it in the default awserr.Error and providing it it's own concrete error that could be compared against. As right now detection of this requires substring matching of the original error message by users which is not ideal. We would welcome a PR request that would provide a helper utility that could be used by users looking to implement a custom retryer who wish to detect this error case and implement their own retry logic.

I think there is a syscall.Errno type nested inside the *url.Error returned by these failed requests which can be used to determine the specific syscall error. Here is a patch to the test I added in this PR to produce a connection reset error which shows the nesting:

diff --git a/aws/request/request_retry_test.go b/aws/request/request_retry_test.go
index 3c80614d..5cd99307 100644
--- a/aws/request/request_retry_test.go
+++ b/aws/request/request_retry_test.go
@@ -7,6 +7,7 @@ import (
        "net/http/httptest"
        "net/url"
        "os"
+       "syscall"
        "testing"
        "time"
 
@@ -149,6 +150,16 @@ func TestShouldRetryError_connection_reset(t *testing.T) {
 
        debugerr(t, err)
 
+       t.Logf("error type: %T", err)
+       uerr := err.(*url.Error)
+       t.Logf("url error type: op: %s, %T", uerr.Op, uerr.Err)
+       oerr := uerr.Err.(*net.OpError)
+       t.Logf("op error type: op: %s, %T", oerr.Op, oerr.Err)
+       serr := oerr.Err.(*os.SyscallError)
+       t.Logf("syscall error type: syscall: %s, %T", serr.Syscall, serr.Err)
+       nerr := serr.Err.(syscall.Errno)
+       t.Logf("errno: %s, is reset: %t", nerr, nerr == syscall.ECONNRESET)
+
        if shouldRetryError(err) == false {
                t.Errorf("this connection was reset by peer and should be retried")
        }

And the results of running the test show the nested error types with a syscall.Errno after unwrapping a few levels:

=== RUN   TestShouldRetryError_connection_reset
--- PASS: TestShouldRetryError_connection_reset (0.00s)
    request_retry_test.go:195: Error, Get http://127.0.0.1:58659: read tcp 127.0.0.1:58660->127.0.0.1:58659: read: connection reset by peer
    request_retry_test.go:199: Get http://127.0.0.1:58659: read tcp 127.0.0.1:58660->127.0.0.1:58659: read: connection reset by peer is a temporary error: false
    request_retry_test.go:153: error type: *url.Error
    request_retry_test.go:155: url error type: op: Get, *net.OpError
    request_retry_test.go:157: op error type: op: read, *os.SyscallError
    request_retry_test.go:159: syscall error type: syscall: read, syscall.Errno
    request_retry_test.go:161: errno: connection reset by peer, is reset: true
PASS
ok  	github.com/aws/aws-sdk-go/aws/request	0.032s

I would also recommend renaming or breaking up the utility function isErrConnectionReset which is used in a couple of places but has a name that is not actually descriptive of its function which is detecting specifically whether it is a write connection reset or a broken pipe error.

At the moment, I don't have the time to implemented the suggested PR, but I'll reference this PR if I do in the future.

Per aws/aws-sdk-go#2926 (comment), the upstream SDK doesn't retry connection reset errors on reads, because at that point we can't tell if the server has already applied the operation. For S3 operations, I think pretty much all requests should be safely idempotent, so I think we should retry these errors. I saw some of my own long-running jobs fail this weekend due to network weather and some dropped connections that s5cmd did not retry, which I found very surprising. Per aws/aws-sdk-go@c3d2710, the `strings.Contains` check does seem to be the best way to check for this error.

Per aws/aws-sdk-go#2926 (comment), the upstream SDK doesn't retry connection reset errors on reads, because at that point we can't tell if the server has already applied the operation. For S3 operations, I think pretty much all requests should be safely idempotent, so I think we should retry these errors. I saw some of my own long-running jobs fail this weekend due to network weather and some dropped connections that s5cmd did not retry, which I found very surprising. Per aws/aws-sdk-go@c3d2710, the `strings.Contains` check does seem to be the best way to check for this error. Fixes #294

This change implements a custom retryer that we use when initializing a new s3 client for interaction with the external storage sink. This change was motivated by the increased number of backup job failures we were observing with a `read: connection reset` error being thrown by s3. A read connection reset error is thrown when the SDK is unable to read the response of an underlying API request due to a connection reset. The DefaultRetryer in the AWS SDK does not treat this error as a retryable error since the SDK does not have knowledge about the idempotence of the request, and whether it is safe to retry - aws/aws-sdk-go#2926 (comment). In CRDB all operations with s3 (read, write, list) are considered idempotent, and so we can treat the read connection reset error as retryable too. Release note (bug fix): Retry s3 operations when they error out with a read connection reset error instead of failing the top level job.

…83526 #83533 #83548 #83559 #83576 #83588 57838: execinfra: remove MetadataTest* processors r=yuzefovich a=yuzefovich This commit removes `MetadataTestSender` and `MetadataTestReceiver` processors since they no longer provide much value. I believe they were introduced when we added a `ProducerMetadata` as a return parameter to `Next` method in order to ensure that at least some artificial metadata is propagated correctly throughout the whole flow. The main goal of this commit is the removal of `fakedist-metadata` and `5node-metadata` logic test configs in order to speed up the CI time. The justification for removal of these processors without putting anything in their place is that these processors are not that useful - the only thing they can test is that *some* metadata is propagated through the row-based flows. Note that they don't test whether all necessary metadata is emitted (for example, whether `LeafTxnFinalState`). We've been using the vectorized engine as the default for a while now, and these processors don't get planned with the vectorized flows. Thus, it seems silly to have a logic test config like `fakedist-metadata` that is part of the default configs. Addresses: #57268. Release note: None 78104: kvserver: include MVCC range keys in replica consistency checks r=aliher1911 a=erikgrinaker This patch adds handling of MVCC range keys in replica consistency checks. These are iterated over as part of `MVCCStats` calculations and hashed similarly to point keys. Range keys will only exist after the version gate `ExperimentalMVCCRangeTombstones` has been enabled, so a separate version gate is not necessary. Release note: None 81880: changefeedccl: fix changefeed telemetry resolved r=samiskin a=samiskin Resolves #81599 Our internal-use changefeed create / failed telemetry events would incorrectly record "yes" for any non-no value of resolved, rather than using the value that was passed in. This change resolves that, emitting yes for "resolved" and the value itself for "resolved=<value>". Release note: None 82686: sql: support ttl_expiration_expression for row-level TTL r=otan a=ecwall refs #76916 Release note (sql change): Allow `CREATE TABLE ... WITH (ttl_expiration_expression='...')`. Allow `ALTER TABLE ... SET (ttl_expiration_expression='...')` and `ALTER TABLE ... RESET (ttl_expiration_expression)`. ttl_expiration_expression accepts an expression that returns a timestamp to support custom TTL calculation. 82885: streamingccl: minor error style cleanups r=miretskiy,HonoreDB a=stevendanna Capitalized error messages are rare in the code base, so I've made these more consistent with the rest of the code base. Release note: None 83004: roachprod: add prometheus/grafana monitoring r=irfansharif a=msbutler Previously, only roachtests could spin up prom/grafana servers that lasted the lifetime of the roachtest. This PR introduces new roachprod cmds that allow a roachprod user to easily spin up/down their own prom/grafana instances. The PR also hooks up roachtests that rely on prom/grafana into this new infrastructure. Release note: none 83027: sql: add plan gist to sampled query telemetry log r=THardy98 a=THardy98 Partially resolves: #71328 This change adds a plan gist field to the sampled query telemetry log. The plan gist is written as a base64 encoded string. Release note (sql change): The sampled query telemetry log now includes a plan gist field. The plan gist field provides a compact representation of a logical plan for the sampled query, the field is written as a base64 encoded string. 83445: roachtest: skip decommissionBench/nodes=8/cpu=16/warehouses=3000 r=erikgrinaker a=tbg #82870 Release note: None 83505: streamingccl: deflake TestPartitionedStreamReplicationClient r=miretskiy a=stevendanna Previously, this test would fail occasionally with: ``` partitioned_stream_client_test.go:192: Error Trace: partitioned_stream_client_test.go:192 Error: Target error should be in err chain: expected: "context canceled" in chain: "pq: query execution canceled" Test: TestPartitionedStreamReplicationClient ``` This is a result of the lib/pq library asyncronously sending a CancelRequest when it notices the context cancellation. The cancel request may result in the query ending before the local context cancellation produces an error. We now count this query cancellation error as an acceptable response since the goal of the test is to assert that we respond to context cancellation. Fixes #76919 Release note: None 83526: backupccl: create tree.SystemUsers, a new DescriptorCoverage enum r=adityamaru a=msbutler Previously during planning and execution RESTORE SYSTEM USERS was identified by a `jobDetails` field. This refactor now identifies this flavor of restore with a new DescriptorCoverage enum value, `tree.SystemUsers. This refactor eases the logic around exposing extra processing steps for flavors of backup/restore that target different sets of descriptors. Release note: None 83533: amazon: add custom retryer to retry on `read: connection reset` r=miretskiy,stevendanna a=adityamaru This change implements a custom retryer that we use when initializing a new s3 client for interaction with the external storage sink. This change was motivated by the increased number of backup job failures we were observing with a `read: connection reset` error being thrown by s3. A read connection reset error is thrown when the SDK is unable to read the response of an underlying API request due to a connection reset. The DefaultRetryer in the AWS SDK does not treat this error as a retryable error since the SDK does not have knowledge about the idempotence of the request, and whether it is safe to retry - aws/aws-sdk-go#2926 (comment). In CRDB all operations with s3 (read, write, list) are considered idempotent, and so we can treat the read connection reset error as retryable too. Release note (bug fix): Retry s3 operations when they error out with a read connection reset error instead of failing the top level job. 83548: changefeedccl: Support more stable functions. r=miretskiy a=miretskiy Add support for additional stable functions to CDC expressions. Fixes #83466 Release Notes: None 83559: sql: ensure that the plan is closed in apply joins in some error cases r=yuzefovich a=yuzefovich Previously, it was possible for the apply join's plan to be left unclosed when an error is encountered during the physical planning of the main query, and this has now been fixed. We do so by explicitly closing the plan in such a scenario. Fixes: #82705. Fixes: #83368. Release note: None 83576: streamingccl: re-enabled TestRandomClientGeneration r=miretskiy a=stevendanna TestRandomClientGeneration was skipped in #61292 as a flake. However, in the time since then, other changes in this code broke this test more completely. Re-enabling the test required a few unrelated changes: - The stream ingestion processor required a fully formed job to be able to poll the cutover time. Now, test code can set a cutoverProvider that doesn't depend on a full job record. - The request intercepting depended on an explicit client being set. This test was rather passing the processor a randgen URI. Now we pass the client explicitly and also update the test code to make it clear that the stream URI isn't actually used for anything. - The code was attempting to validate the number of rows using SQL. I haven't dug into how this was working in the past. But as we are connecting to the host tenant and the keys are being ingested to a guest tenant, we would need a connection to the guest tenant to validate the table data. I've simply removed this assertion since I don't think it was testing very much compared to the KV level assertions also used in the test. - The test code assumed that the partitions were keyed based on the subscription token rather than the subscription ID. It isn't clear what the original source of the flakiness was. However, the test has run a few hundred times under stress without issue. Alternatively, we could just delete this test. Fixes #61287 Release note: None 83588: tree: DROP ROLE should not implement CCLOnlyStatement r=rafiss a=rafiss This was moved out of CCL licensing a few releases ago. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

This change implements a custom retryer that we use when initializing a new s3 client for interaction with the external storage sink. This change was motivated by the increased number of backup job failures we were observing with a `read: connection reset` error being thrown by s3. A read connection reset error is thrown when the SDK is unable to read the response of an underlying API request due to a connection reset. The DefaultRetryer in the AWS SDK does not treat this error as a retryable error since the SDK does not have knowledge about the idempotence of the request, and whether it is safe to retry - aws/aws-sdk-go#2926 (comment). In CRDB all operations with s3 (read, write, list) are considered idempotent, and so we can treat the read connection reset error as retryable too. Release note (bug fix): Retry s3 operations when they error out with a read connection reset error instead of failing the top level job.

skmcgrail added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 13, 2019

emil2k closed this Nov 18, 2019

diehlaws mentioned this pull request Nov 22, 2019

Connection reset errors are not retryable #2908

Closed

diehlaws mentioned this pull request Dec 19, 2019

SDK does not consider "read: connection reset" as retryable #3027

Closed

bflad mentioned this pull request Aug 26, 2020

Intermittent network issues (read: connection reset Errors) hashicorp/terraform-provider-aws#14163

Closed

skmcgrail mentioned this pull request Mar 19, 2021

Support retrying S3 connection resets on GetObject body #1242

Closed

nelhage mentioned this pull request Jun 14, 2021

Treat "connection reset" errors as retryable. peak/s5cmd#298

Merged

skmcgrail mentioned this pull request Jun 24, 2021

Retry disabled on "read: connection reset" #3971

Closed

3 tasks

adityamaru mentioned this pull request Jun 28, 2022

amazon: add custom retryer to retry on read: connection reset cockroachdb/cockroach#83533

Merged

adityamaru mentioned this pull request Jun 29, 2022

release-22.1: amazon: add custom retryer to retry on read: connection reset cockroachdb/cockroach#83581

Merged

itaiad200 mentioned this pull request Jul 5, 2022

Add logging to catch testing race-condition treeverse/lakeFS#3615

Merged

adityamaru mentioned this pull request Nov 7, 2022

release-21.2: amazon: add custom retryer to retry on read: connection reset cockroachdb/cockroach#91395

Merged

techdragon mentioned this pull request Mar 9, 2023

Manual retry logic is needed for "read: connection reset by peer" larrabee/s3sync#73

Open

jmklix mentioned this pull request Apr 21, 2023

I wonder why read: connection reset shouldn't retry #4793

Closed

Thomasvdam mentioned this pull request May 22, 2024

Add retry to S3 connection reset errors sedaprotocol/seda-chain#274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make read "connection reset" errors retryable. #2926

Make read "connection reset" errors retryable. #2926

emil2k commented Nov 5, 2019 •

edited

Loading

skmcgrail commented Nov 13, 2019

emil2k commented Nov 13, 2019

skmcgrail commented Nov 13, 2019

emil2k commented Nov 18, 2019

Make read "connection reset" errors retryable. #2926

Make read "connection reset" errors retryable. #2926

Conversation

emil2k commented Nov 5, 2019 • edited Loading

skmcgrail commented Nov 13, 2019

emil2k commented Nov 13, 2019

skmcgrail commented Nov 13, 2019

emil2k commented Nov 18, 2019

emil2k commented Nov 5, 2019 •

edited

Loading